Using Vector Embeddings For Sentiment Analysis

Rod Acosta, Kevin Furbish, Ibrahim Khan, Anthony Washington

1. Introduction

  • Sentiment Analysis:
    • Defined as a branch of Natural Language Processing (NLP) that focuses on the computational treatment of opinions, sentiments, and subjectivity in digital text (Medhat, Hassan, and Korashy 2014).

1.1 Sentiment Analysis

  • Importance of Sentiment Analysis:
    • Essential for understanding public opinion, aiding businesses in refining marketing strategies, improving products, and enhancing customer satisfaction.
    • The rise of social media and e-commerce platforms has increased the importance of sentiment analysis for real-time consumer behavior insights.
  • Case Study: Twitter Data:
    • Hasan, Maliha, and Arifuzzaman (2019) used Twitter data to demonstrate the potential of sentiment analysis.
    • Their methodology combined Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) models with logistic regression, achieving an accuracy of 85.25% in classifying tweets as positive or negative (Hasan, Maliha, and Arifuzzaman 2019).
  • Case Study: E-commerce Reviews:
    • Kathuria, Sethi, and Negi (2022) applied sentiment analysis to e-commerce reviews using various machine learning models (logistic regression, AdaBoost, SVM, Naive Bayes, and random forest).
    • Analyzed the Women’s E-commerce Clothing Reviews dataset to understand consumer behavior and the impact of electronic word-of-mouth (eWOM) on customer attitudes and product sales (Kathuria, Sethi, and Negi 2022).
  • Granular Approach:
    • Nasukawa and Yi (2003) focused on extracting sentiments linked to specific subjects rather than entire documents.
    • Their prototype system used a syntactic parser and sentiment lexicon to detect sentiments in web pages and news articles, offering detailed insights into specific opinions (Nasukawa and Yi 2003).
  • Conclusion:
    • These studies illustrate the evolution and application of sentiment analysis using NLP, highlighting its critical role in extracting insights from vast amounts of text data for strategic decision-making.

1.2 Vector Embeddings In Natural Language Processing

  • Introduction to Embeddings:
    • Vector embeddings are crucial for encoding or describing the sentiment of words or groups of words.
    • Embeddings improve upon representing words as indices in a vocabulary by encoding relationships or similarities between words (Camacho-Collados and Pilehvar 2020).
  • Advantages of Embeddings:
    • Unlike simple vocabulary indices, embeddings can represent multiple meanings of words and their semantic and syntactic patterns (Pennington, Socher, and Manning 2014).
    • Example: In the Word2Vec model, vector arithmetic can evaluate analogies (e.g., “king” - “man” + “woman” = “queen”)(Mikolov et al. 2013).
  • Advanced Embedding Models:
    • Sentence embeddings build on word embeddings to represent sentences, improving over Bag of Words models, which have high dimensionality and sparseness issues(Pilehvar and Camacho-Collados 2020).
    • Document embeddings further extend this concept to encode entire documents for NLP tasks where word ordering is important(Pilehvar and Camacho-Collados 2020).

1.3 Using Embeddings For Sentiment Analysis

  • Performance Improvement:
    • Word embedding and deep learning models have significantly enhanced NLP tasks, including sentiment analysis (Kasri et al. 2022).
    • Popular methods include Word2Vec and Global Vectors (GloVe), trained on different datasets (Google News and Wikipedia 2014, respectively) (N et al. 2024).
  • Challenges with Pre-trained Embeddings:
    • Pre-trained word embeddings often fail to capture contextual sentiment information, leading to inaccuracies (e.g., mapping “good” and “bad” to neighboring vectors) [Liang et al. (2023)](Tang et al. 2016).
    • The paper aims to explore these challenges and investigate alternative embedding methods for more accurate sentiment analysis.

Methods

What is a neural Network?

  • A neural network is a type of algorithm that mimics the structure and function of the human brain. Their goal is to create an artificial system that can process and analyze data in a similar way.
  • There are different types of neural networks but there are some common elements between most of them. Those elements are:
    • Artificial Neurons
    • Layers

Neural Network Layers

  • Neural networks usually have three types of layers:
    • Input Layer
    • Hidden layers
    • Output layer

What are embeddings?

  • Embeddings are a technique that allow us to map words or phrases into a corresponding vector of real numbers, where the position and direction of the vector capture the word’s semantic meaning in relation to other words.
  • They make high-dimensional data like words readable to our algorithm/model and allows our model to recognize and learn meaningful relationships and similarities between words

Dense Layer & Cosine Similarity

  • Cosine Similarity
    • Measures the cosine of the angle between two non-zero vectors, providing a measure of similarity.
    • The smaller the angle the higher the similarity between the two vectors.
    • \(cosine\_similarity(u,v) = \frac{u.v}{||u|| ||v||}\)
  • Dense Layer
    • A logistic regression model with a sigmoid activation function used for binary classification.
    • It outputs the probability that the input belongs to a positive class.
    • \(y=\sigma(W⋅z+b)\)
    • Where:
      • z is the flattened input vector.
      • W is the weight vector.
      • b is the bias term.
      • \(\sigma(x) = \frac{1}{1+e^{-x}}\) is the sigmoid function.

Sentiment Analysis

  • Through the use of a neural network and it’s hidden layers (embedding & dense), and the cosine similarity we are able to take inputs and classify them as being part of a positive or negative class based on what our model has learned from our training dataset.

Analysis and Results

Dataset Description

  • Data set of 25,000 movie reviews from IMDB
  • Max 30 reviews for each movie since popular movies are rated more often than unpopular movies.
  • Includes only the top 5,000 most frequent words, minus the top 50 most frequent words
  • IMDB reviews 1-10 star rating converted to a 0-1 scale
  • Reviews already in vectorized format

Vectorization

  • Neural Networks require numeric inputs, not the natural language of reviews
  • Vectorization represents each word with a unique numeric substitution. For example the following texts:
    • “this is fun”, “fun times ahead”, “fun is ahead of times”
  • Results in a vocabulary of [this, is, fun, times, ahead, of]
  • Vectorized observations: [1, 2, 3], [3, 4, 5] and [3, 2, 5, 6, 4]
  • TensorFlow includes the IMDB dataset in a vectorized format

Statistical Modeling

Neural Network Implementation

  • Neural network model implemented and trained using TF’s Keras
  • TF and Keras implemented in Python and accessed from R via Reticulate
  • Model learns embeddings for each word in the vocabulary
  • Multi-dimensional embedding vectors placed close in the learned vector-space to vectors of similar words
  • Similar means having a similar contextual meaning in the training dataset and its sentiment classification
    • “gem” and “favorite” would be similar in context of a movie review, but not in a general context

Neural Network Implementation

  • Keras Sequential model (model is built layer-by-layer)
  • The neural networks leverages the Keras embedding layer
    • Converts input vocabulary index into a vector of a chosen dimension
    • Dimensionality important hyperparameter that controls compression vs overfitting
  • Output of embedding layer is the dimensionality hyperparameter
  • Embedding layer connects to a dense layer
  • Dense layers require a 1D input vector so embedding layer output is flattened to 1D
  • Dense layer has a 1 output unit with sigmoid activation function that receives the flattened embedding output
  • Output unit is the sentiment score
  • Back-propagation trains embedding layer weights to be similar based on sentiment

Determining Vector Dimensionality

  • Number of dimensions for the embeddings must be selected
  • Dimensionality selection often done ad hoc, or with grid search
  • 2 through 7 dimensions were tested and compared by testing accuracy and qualitatively

Model Performance: 2D

  • 2 dimensions had best accuracy, but failed to capture meaning
  • Closest embeddings to the embedding for “awful”:
word cosine similarity
awful 1.0
lame 1.0
alcoholic 1.0
sadly 0.99
relevant 0.99
are 0.99

Model Performance: 2D vs 7D

  • 7 Dimensions had similar accuracy to 2, but better embedding performance
Table 1: Closest Words By Dimensionality
(a) 2 Dimensions
reference word closest words
awful lame, alcoholic, sadly, relevant, are
mediocre effort, turkey, terrible, stereotype, repeat
perfect lovers, sing, manager, bath, donald
favorite deeply, roud, marie, polanski, poetry
(b) 7 Dimensions
reference word closest words
awful ultimately, painful, sorry, fake, nowhere
mediocre teeth, incompetent, main, disappointing, generous
perfect great, seeking, freedom, tremendous, excellent
favorite paulie, excellent, necessary, great, seeking

Data and Visualization

Conclusion

The development and analysis of the word embedding model for classifying IMDB movie reviews demonstrated promising results. The optimal number of embedding dimensions was identified as 7, achieving an accuracy of 87.34% on the test dataset. This was determined through extensive experimentation, revealing that higher dimensions, such as 7, provided competitive and consistent accuracy. The model’s performance is noteworthy, given the constraints of training on only the top 5000 most common words, minimal data preprocessing, and limiting input sequences to the first 500 words of each review. These factors illustrate the model’s robustness and effectiveness in capturing the semantic relationships within the data.

Furthermore, the embedding similarity results showed that the model could meaningfully capture semantic relationships, as evidenced by the coherent and relevant similar words found for terms like “awful,” “mediocre,” “perfect,” and “favorite.” The final training session, capped at 10 epochs, ensured the model did not overfit, maintaining its accuracy and reliability. Overall, the model’s strong performance under constrained conditions highlights its potential for practical applications in sentiment analysis, offering an efficient and effective solution for understanding and categorizing movie reviews.

Camacho-Collados, Jose, and Mohammad Taher Pilehvar. 2020. “Embeddings in Natural Language Processing.” In Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts, edited by Lucia Specia and Daniel Beck, 10–15. Barcelona, Spain (Online): International Committee for Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-tutorials.2.
Hasan, Md. Rakibul, Maisha Maliha, and M. Arifuzzaman. 2019. “Sentiment Analysis with NLP on Twitter Data.” In 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), 1–4. https://doi.org/10.1109/IC4ME247184.2019.9036670.
Kasri, Mohammed, Marouane Birjali, Mohamed Nabil, Abderrahim Beni-Hssane, Anas El-Ansari, and Mohamed El Fissaoui. 2022. “Refining Word Embeddings with Sentiment Information for Sentiment Analysis.” Journal of ICT Standardization 10 (3): 353–82. https://doi.org/10.13052/jicts2245-800X.1031.
Kathuria, Priyanshi, Parth Sethi, and Rithwick Negi. 2022. “Sentiment Analysis on e-Commerce Reviews and Ratings Using ML & NLP Models to Understand Consumer Behavior.” In 2022 International Conference on Recent Trends in Microelectronics, Automation, Computing and Communications Systems (ICMACC), 1–5. https://doi.org/10.1109/ICMACC54824.2022.10093674.
Liang, Bin, Rongdi Yin, Jiachen Du, Lin Gui, Yulan He, Min Yang, and Ruifeng Xu. 2023. “Embedding Refinement Framework for Targeted Aspect-Based Sentiment Analysis.” IEEE Transactions on Affective Computing 14 (1): 279–93. https://doi.org/10.1109/TAFFC.2021.3071388.
Medhat, Walaa, Ahmed Hassan, and Hoda Korashy. 2014. “Sentiment Analysis Algorithms and Applications: A Survey.” Ain Shams Engineering Journal 5 (4): 1093–113. https://doi.org/10.1016/j.asej.2014.04.011.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” https://arxiv.org/abs/1301.3781.
N, Lavanya B., Anitha Rathnam K. V, Kiran K, P. Deepa Shenoy, and Venugopal K. R. 2024. “Fusion of Deep Learning with Advanced and Traditional Embeddings in Sentiment Analysis.” In 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), 1–6. https://doi.org/10.1109/I2CT61223.2024.10543279.
Nasukawa, Tetsuya, and Jeonghee Yi. 2003. “Sentiment Analysis: Capturing Favorability Using Natural Language Processing.” In Proceedings of the 2nd International Conference on Knowledge Capture, 70–77.
Pennington, Jeffrey, Richard Socher, and Christopher D Manning. 2014. “Glove: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43.
Pilehvar, Mohammad Taher, and Jose Camacho-Collados. 2020. Embeddings in Natural Language Processing: Theory and Advances in Vector Representations of Meaning. Morgan & Claypool Publishers.
Tang, Duyu, Furu Wei, Bing Qin, Nan Yang, Ting Liu, and Ming Zhou. 2016. “Sentiment Embeddings with Applications to Sentiment Analysis.” IEEE Transactions on Knowledge and Data Engineering 28 (2): 496–509. https://doi.org/10.1109/TKDE.2015.2489653.